JCO Clinical Cancer Informatics — Latest Matching Preprints

1

A Reproducible Health Informatics Pipeline for Simulating and Integrating Early-Phase Oncology Clinical, Biomarker, and Pharmacokinetic Data for Exploratory Decision-Support Analytics

Petalcorin, M. I. R.

2026-04-02 health informatics 10.64898/2026.03.27.26349538 medRxiv

Top 0.1%

44.8%

Show abstract

Background: Early-phase oncology development increasingly depends on integrated interpretation of clinical outcomes, translational biomarkers, and pharmacokinetic exposure rather than toxicity alone. This shift has created a need for reproducible analytical workflows that can combine heterogeneous trial data into traceable, analysis-ready outputs suitable for exploratory review and early decision support. Objective: To develop a reproducible Python-based workflow that simulates a plausible early-phase oncology study, integrates clinical, biomarker, and pharmacokinetic data, and generates analysis-ready datasets, visual summaries, and exploratory predictive models relevant to early development analytics. Methods: A workflow was constructed to simulate an early-phase oncology cohort of 120 patients distributed across multiple dose levels. Three synthetic raw data sources were generated, including patient-level clinical data, baseline biomarker data, and longitudinal pharmacokinetic profiles. These sources were merged into a single analysis-ready dataset containing derived variables such as tumor percent change from baseline, clinical-benefit status, exposure summaries, adverse-event indicators, and survival outcomes. The workflow produced structured tables, patient listings, waterfall plots, Kaplan-Meier-style survival curves, biomarker-response visualizations, pharmacokinetic profile plots, and exploratory machine-learning outputs. Results: The final integrated dataset contained 120 patients and 30 variables. Median survival across the simulated cohort was 243.8 days, and higher dose groups showed improved median survival and greater clinical benefit relative to the low-dose group. Clinical benefit increased from 8.6% in the low-dose group to 29.0% in the medium-dose group and 45.2% in the high-dose group. Higher baseline LDH, CRP, and ctDNA fraction tracked with less favorable tumor-response trajectories, whereas higher exposure, reflected by AUC and Cmax, associated with improved disease control. Pharmacokinetic profiles showed clear dose-dependent separation. Grade 3 or higher adverse-event rates remained within a plausible exploratory range across dose groups. A random-forest model for clinical benefit achieved an exploratory ROC AUC of 0.845, while a logistic-regression model for strict responder status could not be fit because no simulated patient met the prespecified objective response threshold. Conclusions: This proof-of-concept demonstrates that a transparent Python workflow can generate a coherent early-phase oncology analytical ecosystem from synthetic inputs. The workflow supports integration of heterogeneous data streams, derivation of analysis-ready variables, production of interpretable outputs, and exploratory modeling in a reproducible framework. Although the simulated responder prevalence was too low to support objective response modeling, this limitation itself highlights the importance of simulation calibration for downstream analytical validity. The framework provides a practical Health Informatics demonstration of how early oncology trial data can be structured and analyzed for exploratory translational decision support.

2

An End-to-End Synthetic Oncology Clinical Trial Framework Integrating Radiographic Response, Circulating Tumor DNA, Safety, and Survival for Decision-Oriented Clinical Data Science

Petalcorin, M. I. R.

2026-04-08 health informatics 10.64898/2026.04.07.26350297 medRxiv

Top 0.1%

28.2%

Show abstract

Background: Modern oncology development depends on integrating radiographic response, molecular biomarkers, treatment exposure, safety, and survival endpoints, yet access to well-structured patient-level trial data is often limited. Methods: We developed a synthetic, literature-informed phase II randomized oncology trial framework that followed the sequence Patient [->] Data [->] Dataset [->] Analysis [->] Tables/Figures [->] Decision. A cohort of randomized patients was simulated with baseline demographic and disease features, longitudinal tumor measurements, circulating tumor DNA, inflammatory and exploratory biomarkers, adverse events, treatment exposure, and survival outcomes. Raw source datasets were transformed into SDTM-like domains and ADaM-like analysis datasets, then analyzed for baseline characteristics, exposure, best overall response, survival, subgroup hazard ratios, longitudinal tumor and biomarker changes, exposure-response, and safety. Results: The treatment arm showed a coherent efficacy signal across multiple analytical layers. Treatment increased objective response and clinical benefit, reduced tumor burden over time, and prolonged survival. Median overall survival increased from 135 days in the control arm to 288 days in the treatment arm, with an approximate hazard ratio of 0.661 (95% CI, 0.480-0.911; p = 0.011). Median progression-free survival increased from 116 to 208 days, with an approximate hazard ratio of 0.601 (95% CI, 0.418-0.864; p = 0.006). Circulating tumor DNA showed a more favorable trajectory in treated patients and aligned directionally with radiographic and survival benefit. Safety analyses showed increased treatment-related toxicity, but the overall safety profile remained interpretable and compatible with continued development. Conclusions: This study demonstrates that a synthetic, literature-informed oncology trial can reproduce a biologically plausible and analytically coherent efficacy-safety signal architecture across radiographic, molecular, and time-to-event endpoints, providing a decision-oriented prototype for translational oncology clinical data science. Keywords: synthetic clinical trial, oncology, ctDNA, Kaplan-Meier, biomarker, survival analysis, translational data science, ADaM, SDTM

3

From Registration to Insight: How STRONG AYA Transforms Registry Data to Enhance Decision-Support Tools for Adolescent and Young Adult Oncology

Hughes, N.; Hogenboom, J.; Carter, R.; Norman, L.; Gouthamchand, V.; Lindner, O.; Connearn, E.; Lobo Gomes, A.; Sikora-Koperska, A.; Rosinska, M.; Pogoda, K.; Wiechno, P.; Jagodzinska-Mucha, P.; Lugowska, I.; Hanebaum, S.; Dekker, A.; van der Graaf, W.; Husson, O.; Wee, L.; Feltbower, R.; Stark, D.

2026-04-04 oncology 10.64898/2026.04.03.26350064 medRxiv

Top 0.1%

23.0%

Show abstract

Background: Population-based cancer registers (PBCR) are important for monitoring trends in cancer epidemiology, facilitating the implementation of effective cancer services. Adolescents and Young Adult (AYA) with cancer are a patient group with a unique set of needs. The utility of PBCR in AYA is limited by the lack of AYA-specific data items. STRONG AYA, an international multidisciplinary consortium is addressing this through federated learning (FL) methodology and novel data visualisation concepts. A Core Outcome Set (COS) has been developed to measure outcomes of importance through clinical data and Patient Reported Outcomes (PROs). We describe how data from the Yorkshire Specialist Register of Cancer in Children and Young People (YSRCCYP), a PBCR in the UK is being used within STRONG AYA and how the subsequent analyses can guide patient consultations. Methods: Data from the YSRCCYP were imported into a Vantage 6 node, from which FL analyses are performed along with data provided by other consortium members. The results are extracted into the PROMPT software and integrated into patient electronic healthcare records. Results: Healthcare professionals can view the results of individual PROs at various time points and in comparison, to summary analyses carried out within the STRONG AYA infrastructure. Results can be filtered by age, disease, country and stage. Conclusion: We have demonstrated how a regional PBCR can contribute to a pan-European infrastructure and analyses viewed to enhance patient consultations. Such analyses have the potential to be used for research and policy-making, improving outcomes for AYA.

4

Onca: An Open 9B Language Model for Pancreatic Cancer Clinical Tasks

Shim, K. B.

2026-04-24 oncology 10.64898/2026.04.16.26351055 medRxiv

Top 0.1%

18.3%

Show abstract

Pancreatic ductal adenocarcinoma (PDAC) remains one of the deadliest solid tumors and continues to face low treatment-trial participation, fragmented evidence workflows, and labor-intensive ab- straction of unstructured clinical text. Existing oncology-focused language models show promise, but many depend on private institutional corpora, limiting reproducibility and practical reuse across centers. We present Onca, an open 9B dense model designed for four PDAC-relevant tasks: trial eligibility screening, case-specific clinical reasoning, structured pathology report extraction, and molecular variant evidence reasoning. Onca is fine-tuned from Qwopus3.5-9B-v3 with a single Un- sloth BF16 LoRA adapter on 37,364 training rows drawn from openly available sources. The evalu- ation spans 11 panels and compares Onca against Woollie-7B, CancerLLM-7B, OpenBioLLM-8B, and the unmodified Qwopus base. Onca achieves the strongest overall results on Trial Screening (81.6 F1), Clinical Reasoning (14.1 composite), Pathology Extraction (30.5 field exact-match), Pub- MedQA Cancer (68.3 macro-F1), and PubMedQA (66.5 macro-F1). The strongest gains appear in tasks closest to routine oncology workflow, especially trial review and pathology structuring. These findings suggest that clinically targeted pancreatic-cancer language models can be built from open data with competitive performance while remaining practical to train on a single workstation-scale GPU setup.

5

SCOPE: Integrating Organoid Screening and Clinical Variables Through Machine Learning for Cancer Trial Outcome Prediction

Bouteiller, J.; Gryspeert, A.-R.; Caron, J.; Polit, L.; Altay, G.; Cabantous, M.; Pietrzak, R.; Graziosi, F.; Longarini, M.; Schutte, K.; Cartry, J.; Mathieu, J. R.; Bedja, S.; Boileve, A.; Ducreux, M.; Pages, D.-L.; Jaulin, F.; Ronteix, G.

2026-04-11 oncology 10.64898/2026.04.10.26350512 medRxiv

Top 0.1%

14.3%

Show abstract

Background: Predicting whether a treatment will demonstrate meaningful clinical benefit before committing to a large-scale trial remains a major unmet need in oncology. Patient-derived organoids (PDOs) recapitulate individual tumor drug sensitivity, but have not been used to forecast population-level trial outcomes. We developed SCOPE (Screening-to-Clinical Outcome Prediction Engine), a platform that integrates PDO drug screening with clinical prognostic modeling to predict arm-level median progression-free survival (mPFS) and objective response rate (ORR) without access to any trial outcome data. Patients and methods: SCOPE was trained on 54 treatment lines from patients with metastatic colorectal cancer (mCRC, n=15) and metastatic pancreatic ductal adenocarcinoma (mPDAC, n=39) with matched clinical data and PDO drug screening across 9 compounds. A Clinical Score module captures baseline prognosis; a Drug Screen Score module quantifies treatment-specific organoid sensitivity. To predict trial outcomes, synthetic patient profiles are generated from published eligibility criteria and matched to a biobank of 81 PDO lines. Predictions were externally validated against 32 arms from 23 published trials, treatment ranking was assessed across 8 head-to-head comparisons, and prospective applicability was tested for daraxonrasib (RMC-6236), a novel pan-RAS inhibitor in mPDAC. Results: Predicted mPFS strongly agreed with published outcomes (R2=0.85, MAE=0.82 months; Pearson r=0.92, P<0.001), approaching the empirical concordance between two independently measured clinical endpoints (ORR vs. mPFS, R2=0.87). ORR prediction was similarly robust (R2=0.71, MAE=7.3 percentage points). Integrating organoid and clinical data significantly outperformed either alone (P=0.001). SCOPE correctly identified the superior arm in 7 of 8 head-to-head comparisons (88%, P<0.05). Applied to daraxonrasib prior to phase 3 data availability, the platform predicted superiority over standard chemotherapy in KRAS-mutant mPDAC, consistent with emerging clinical data. Conclusion: By combining functional organoid drug screening with clinical modeling, SCOPE generates calibrated efficacy predictions for both established regimens and novel agents without prior clinical data. This approach could support clinical trial design, treatment arm selection, and go/no-go decisions, offering a new tool to improve the efficiency of gastrointestinal cancer drug development.

6

A Context-Aware Target Engagement and Pharmacodynamic Biomarker Resource to Accelerate Drug Discovery and Development

Yang, Y.; Zhao, L.; Orouji, S.; Zhu, Y.; Johnson, R. L.; Maxwell, D. S.; Mica, I.; Russell, K. P.; Al-lazikani, B.

2026-04-22 bioinformatics 10.64898/2026.04.19.719411 medRxiv

Top 0.1%

12.2%

Show abstract

Confirming target engagement in tumor experimental models remains a major challenge in oncology drug development. Pharmacodynamic biomarkers can help address this, but few systematic resources link drug targets to candidate biomarkers. We developed TargetTrace, a comprehensive resource to identify and prioritize pharmacodynamic biomarkers across nine key target classes, including transcription factors/cofactors, kinases, phosphatases, ubiquitin ligases, deubiquitinases, acetyltransferases, deacetylases, methyltransferases, and demethylases. Biomarker candidates were gathered from curated molecular interaction resources and refined using external annotations to improve accuracy. For enzyme targets with measurable substrate changes, we applied a two-agent large language model workflow, followed by manual review, to harmonize antibody information from the antibody resources and ensure that the selected biomarkers are measurable with existing laboratory tests. From more than 92,000 input interactions and over 2,300 targets, we compiled 71,323 target-biomarker relationships involving 2,270 potential drug targets, encompassing both transcription factor/cofactor-target gene and enzyme-substrate interactions. Commercial antibodies were available for over 1,400 biomarkers, supporting laboratory validation. This resource provides a structured and reusable resource for systematic identification and prioritization of pharmacodynamic biomarkers in oncology.

7

Aakhyan: An AI-Powered Vernacular Patient Communication Platform for Oncology in Resource-Limited Settings - System Architecture and Pilot Randomised Trial Protocol

Purkayastha, D. S.

2026-04-17 health informatics 10.64898/2026.04.15.26350965 medRxiv

Top 0.1%

9.2%

Show abstract

Inadequate discharge communication is a well-documented contributor to medication non-adherence, missed follow-ups, and preventable readmissions across healthcare systems worldwide. In resource-limited oncology settings, where patients are often low-literate, speak non-dominant languages, and manage complex multi-drug regimens, this problem is acute and largely unaddressed. We present Aakhyan, a vernacular patient communication platform that addresses the full post-discharge arc: from converting English-language discharge summaries into structured, voice-based vernacular explanations, through medication adherence support, to proactive follow-up management - all delivered via WhatsApp. The architecture is novel in its strict separation of concerns: a vision-language model performs structured JSON extraction from discharge images; all patient-facing content is generated deterministically from clinician-approved templates with community-sensitive vocabulary registers. This design eliminates the hallucination risk inherent in generative AI patient communication (documented at 18-82% in prior studies) while preserving the extraction capability of large language models. The platform supports four language registers, Bengali, Hindi, simplified English for tribal populations, and Assamese, with text-to-speech synthesis across all registers, including a custom grapheme-to-phoneme engine developed for Assamese phonology. Beyond discharge communication, the platform includes scheduled medication adherence nudges, interactive follow-up reminders, and a Daily Availability and Patient Notification System (DAPNS) that notifies patients the evening before their follow-up whether their doctor and required investigations are available, preventing wasted trips by rural patients who travel 2-6 hours to reach the centre. A 100-patient stratified randomised controlled study is planned at Silchar Cancer Centre, with structured teach-back assessment at 48-72 hours post-discharge as the primary comprehension outcome and preliminary clinical efficacy as a secondary objective. This paper describes the clinical rationale, technical architecture, safety framework, and positioning of Aakhyan within the existing literature on mHealth patient communication interventions.

8

Methodological and Clinical Validation of TholdStormDX v0.0.1: An Advanced Stochastic Engine for the Optimization of Thresholds and Multimarker Panels Applied to Oncology

Reinosa, R.

2026-04-27 oncology 10.64898/2026.04.24.26351692 medRxiv

Top 0.1%

8.2%

Show abstract

Introduction: The translation of biomarkers into binary clinical decisions requires the determination of precise cut-off points. This study validates the TholdStormDX v0.0.1 tool, a mathematical engine that employs Dual Annealing, 2- and 4-parameter logistic fitting, and vectorized Monte Carlo simulations for panel optimization under Boolean OR logic. Methods: The tool was evaluated using datasets from four diagnostic domains (Pulmonary Nodules, Hepatocellular Carcinoma [HCC], Cervical Cancer, and Breast Cancer), along with a prognosis-oriented analytical context (Breast Cancer). Validation followed a strict workflow: characterization and selection of the best individual and combined thresholds in the Training (Train) and Validation (Val) sets, using the Test set in a completely independent manner, solely to assess the model s performance and generalizability. Results: The tool enabled precise derivation of cut-off points for both individual biomarkers and multivariable combinations. Evaluation on the Test set objectively demonstrated in which scenarios a single biomarker outperforms a complex panel, promoting clinical parsimony. For example, in Breast Cancer diagnosis, an individual predictor outperformed the optimized panel (Sensitivity: 0.953 / Specificity: 0.952 in Test); conversely, in Hepatocellular Carcinoma, the multivariable combination showed superior performance compared to the single marker (Sens: 0.707 / Spe: 0.718 in Test). Additionally, the self-auditing system effectively flagged metric degradation when noisy variables were included, preventing potential issues. Conclusion: TholdStormDX v0.0.1 proves to be a robust and transparent bioinformatics platform for deriving clinical thresholds. Its main contribution lies in mitigating local minima and promoting clinical parsimony, enabling researchers to objectively identify when a single biomarker is sufficient and when a panel provides real added value. Furthermore, it transforms the problem of biological noise into a safety feature: by systematically warning about algorithmic instability, it prevents overfitting and ensures the clinical viability of medical decisions. Availability: The software is free and distributed under the GNU GPLv3 license. TholdStormDX v0.0.1 is written in Python, and its source code is available at the following GitHub address: https://github.com/roberto117343/TholdStormDX.

9

A Conversational Artificial Intelligence Framework for Comparative Pathway-Level Profiling of Sezary Syndrome and Primary Cutaneous CD8+ Aggressive Epidermotropic Cytotoxic T-Cell Lymphoma (PCAECTCL)

Diaz, F. C.; Waldrup, B.; Carranza, F. G.; Manjarrez, S.; Velazquez-Villarreal, E.

2026-04-17 oncology 10.64898/2026.04.15.26350992 medRxiv

Top 0.1%

6.9%

Show abstract

Background: Sezary syndrome (SS) is an aggressive leukemic variant of cutaneous T-cell lymphoma (CTCL) with distinct clinical and biological features compared to rarer entities such as primary cutaneous CD8+ aggressive epidermotropic cytotoxic T-cell lymphoma (PCAECTCL). Although recurrent genomic alterations in CTCL have been described, comparative analyses at the pathway level across biologically divergent subtypes remain limited. Here, we leveraged a conversational artificial intelligence (AI) platform for precision oncology to enable rapid, integrative, and hypothesis-driven interrogation of publicly available genomic datasets. Methods: We conducted a secondary analysis of somatic mutation and clinical data from the Columbia University CTCL cohort accessed via cBioPortal. Cases were stratified into SS (n=26) and PCAECTCL (n=13). High-confidence coding variants were curated and mapped to biologically relevant signaling pathways and functional gene categories implicated in CTCL pathogenesis. Pathway-level mutation frequencies were compared using Chi-square or Fisher's exact tests, with effect sizes quantified as odds ratios. Tumor mutational burden (TMB) was compared using the Wilcoxon rank-sum test. Subtype-specific co-mutation patterns were evaluated using pairwise association analyses and visualized through oncoplots and network heatmaps. Conversational AI agents, AI-HOPE, were used to iteratively refine cohort definitions, prioritize pathway-level signals, and contextualize findings. Results: TMB was comparable between SS and PCAECTCL (p = 0.96), indicating no significant difference in global mutational load. In contrast, pathway-centric analyses revealed marked qualitative differences. SS demonstrated enrichment of alterations in epigenetic regulators, tumor suppressor and cell-cycle control pathways, NFAT signaling, and DNA damage response mechanisms, consistent with transcriptional dysregulation and immune modulation. PCAECTCL exhibited relatively higher frequencies of alterations involving epigenetic regulators and MAPK pathway signaling, suggesting distinct oncogenic dependencies. Co-mutation analysis revealed a more constrained and focused interaction landscape in SS, whereas PCAECTCL displayed broader and more heterogeneous co-mutation networks, indicative of divergent evolutionary trajectories. Notably, ERBB2 mutations were significantly enriched between subtypes (p = 0.031), highlighting a potential subtype-specific therapeutic vulnerability. Conclusions: This study demonstrates that SS is distinguished from PCAECTCL not by increased mutational burden but by distinct pathway-level architectures, particularly involving epigenetic regulation, immune signaling, and transcriptional control. These findings generate biologically grounded, testable hypotheses for subtype-specific therapeutic targeting and underscore the value of conversational AI as a scalable framework for accelerating discovery in translational cancer genomics.

10

Multi-Task Learning and Soft-Label Supervision for Psychosocial Burden Profiling in Cancer Peer-Support Text

Wang, Z.; Cao, Y.; Shen, X.; Ding, Z.; Liu, Y.; Zhang, Y.

2026-04-04 health informatics 10.64898/2026.04.03.26350034 medRxiv

Top 0.1%

6.8%

Show abstract

Objective: Online cancer peer-support text contains signals of psychosocial burden beyond emotional tone, including treatment burden, financial strain, uncertainty, and unmet support needs. We evaluated 2 modeling extensions: multi-task learning (MTL) for joint prediction of health economics and outcomes research (HEOR) burden dimensions, and soft-label supervision using large language model (LLM)-derived probability distributions. Materials and Methods: We analyzed 10,392 cancer peer-support posts. GPT-4o-mini generated proxy annotations for HEOR burden subscales, composite burden, high-need status, speaker role, cancer type, and emotion probabilities. Study 1 trained a shared ALBERT encoder under 4 MTL conditions: composite and subscale burden targets, each with and without auxiliary heads, using Kendall uncertainty weighting. Study 2 compared soft-label training on LLM emotion distributions with hard-label baselines under regular and token-augmented inputs, evaluating performance against both human labels and AI distributions. Results: Composite-only MTL achieved R2=0.446 for burden regression and weighted F1=0.810 for high-need screening; subscale classification achieved mean weighted F1=0.646. Adding auxiliary role and cancer-type heads reduced regression performance ({triangleup}R2 = -0.209). Soft-label training reduced weighted F1 by 0.16 versus hard-label baselines (0.68 vs. 0.86), and token augmentation did not improve performance under soft supervision. Discussion: Composite-only MTL supported modeling of multidimensional burden-related signals from forum text, whereas auxiliary prediction heads appeared to compete with primary tasks. Soft-label training aligned poorly with human-labeled emotion categories, suggesting that uncalibrated LLM distributions may propagate bias rather than improve supervision. Conclusion: Composite-only MTL was the strongest burden-modeling approach, and hard-label supervision remained preferable for emotion classification.

11

Leveraging Uncertainty Estimates for Drug Response Prediction in Cancer Cell Lines

Iversen, P.; Renard, B. Y.; Baum, K.

2026-04-06 bioinformatics 10.64898/2026.04.03.715851 medRxiv

Top 0.1%

6.5%

Show abstract

MotivationMachine learning models that predict drug response from cancer cell line omics profiles could advance precision oncology, yet their utility is limited by heterogeneous prediction quality and silent failures under distribution shifts. Uncertainty quantification can address these challenges, but systematic evaluation of methods for this domain is lacking. ResultsWe benchmark seven uncertainty-aware models for drug response prediction, comparing epistemic uncertainty via ensemble disagreement, aleatoric uncertainty via distributional modeling, and their combination. Gaussian neural network ensembles reliably flag out-of-distribution inputs and achieve a 64% reduction in mean squared error when filtering to the 10% most confident predictions. We discuss how probabilistic predictions can enable drug candidate analyses that account for therapeutically relevant response ranges. Through uncertainty attribution, we identify transcriptomic signatures of unpredictability, i.e., genes associated with prediction uncertainty. We also demonstrate that uncertainty-guided active learning can prioritize informative experiments. Availability and ImplementationThe code and data are available at https://github.com/PascalIversen/LUDRP and https://zenodo.org/records/19219091. Contactkatharina.baum@fu-berlin.de

12

Backfill Bayesian Ordered Lattice Design for Phase I Clinical Trials

WANG, G.-M.; Tatsuoka, C.

2026-04-06 oncology 10.64898/2026.04.02.26350086 medRxiv

Top 0.1%

6.4%

Show abstract

The Bayesian Ordered Lattice Design (BOLD) method for Phase I clinical trials is extended to address an important challenge. It is widely understood that conventional Phase I trial designs are not consistently effective in determining safe and active dose levels. The US FDA launched the Project Optimus, aimed at reforming the paradigms of dose optimization and selection. We propose a backfill BOLD design (BF-BOLD) that centers on BOLD for dose-finding but also adds an activity evaluation for each patient. Our method for determining the optimal biological dose (OBD) first involves identifying the maximum tolerated dose (MTD) and then assessing activity rates among dose levels below the identified MTD. This approach is straightforward and does not require complex statistical modeling. The results of the simulation indicate that performing dose-finding trials with backfilling can both enhance safety and activity assessment, thereby improving treatment sustainability while also preserving the potential for efficacy of the Recommended Phase II Dose (RP2D). We also demonstrate the applicability of the backfill design for reducing overdose rates, and as a more attractive alternative to small-scale dose expansion trials that follow dose escalation. Backfill designs are an important design approach for early phase trials.

13

CohortContrast: An R Package for Enrichment-Based Identification of Clinically Relevant Concepts in OMOP CDM Data

Haug, M.; Ilves, N.; Umov, N.; Loorents, H.; Suvalov, H.; Tamm, S.; Oja, M.; Reisberg, S.; Vilo, J.; Kolde, R.

2026-04-23 health informatics 10.64898/2026.04.22.26351461 medRxiv

Top 0.1%

6.3%

Show abstract

Abstract Objective To address the unresolved bottleneck of selecting cohort-relevant clinical concepts for treatment trajectory analysis in observational health data, we introduce CohortContrast, an OMOP-compatible R package for enrichment-based concept identification, temporal and semantic noise reduction, and concept aggregation, enabling cohort-level characterization and downstream trajectory analysis. Materials and Methods We developed CohortContrast and applied it to OMOP-mapped observational data from the Estonian nationwide OPTIMA database, which includes all cases of lung, breast, and prostate cancer, focusing here on lung and prostate cancer cohorts. The workflow combines target-control statistical enrichment, temporal/global noise filtering, hierarchical concept aggregation and correlation-based merging, with optional patient clustering for downstream trajectory exploration. We validated the approach with a clinician-based plausibility assessment of extracted diagnosis-concept pairs and evaluated a large language model (LLM) as an auxiliary filtering step. Results We analyzed 7,579 lung cancer and 11,547 prostate cancer patients. The workflow reduced concept dimensionality from 5,793 to 296 concepts (94.9%) in lung cancer and from 5,759 to 170 concepts (97.0%) in prostate cancer, and identified three exploratory patient subgroups in both cohorts. In a plausibility assessment of 466 diagnosis-concept pairs, validators rated 31.3% as directly linked and 57.5% as indirectly linked. Discussion CohortContrast reduces manual concept curation by prioritizing and aggregating cohort-relevant concepts while preserving clinically interpretable treatment patterns in OMOP-based real-world data. Conclusion CohortContrast enables scalable reduction of broad OMOP concept spaces into clinically interpretable, cohort-specific representations for exploratory trajectory analysis and real-world evidence research.

14

Pan-cancer survival modeling reveals structural limits of genomic feature integration in immunotherapy outcomes

Hassan, W.; Adeleke, S.

2026-04-18 bioinformatics 10.64898/2026.04.15.718634 medRxiv

Top 0.1%

6.3%

Show abstract

BackgroundImmune checkpoint inhibitors (ICIs) have improved outcomes across multiple cancer types, yet reliable predictors of survival remain limited. While genomic features such as tumor mutational burden (TMB) are widely used, their contribution to predictive modeling in heterogeneous real-world cohorts remains unclear. We evaluated the relative contributions of clinical and whole-genome sequencing (WGS) features in pan-cancer survival modeling. MethodsWe analyzed 658 patients treated with ICIs with matched WGS data from the Genomics England. Using a leakage-controlled machine learning framework with strict train-test separation, we compared four models: TMB-only, clinical-only, clinical+TMB, and an integrated 11-feature clinico-genomic XGBoost survival model. Model performance was assessed using Harrells concordance index (C-index) with bootstrap confidence intervals. ResultsTMB alone demonstrated near-random discrimination (C-index 0.50; 95% CI 0.44-0.56). Clinical variables substantially improved predictive performance (0.59; 95% CI 0.53-0.64), with marginal gain from adding TMB (0.59). The integrated model achieved a C-index of 0.60 (95% CI 0.55-0.65). While improvement over TMB alone was significant, incremental gain beyond optimized clinical models was modest. Feature attribution analysis showed that model performance was dominated by clinical variables, with genomic features contributing limited additional signal. ConclusionsThese findings suggest that, in heterogeneous pan-cancer cohorts, predictive performance is constrained by the underlying data structure, in which dominant clinical signals overshadow genome-scale features. This study highlights fundamental limitations in integrating genomic data into survival models across diverse cancer types and provides a benchmark for future computational approaches.

15

Evaluating the Large Language Model-Based Quality Assurance Tool for Auto-Contouring

Tozuka, R.; Akita, T.; Matsuda, M.; Tanno, H.; Saito, M.; Nemoto, H.; Mitsuda, K.; Kadoya, N.; Jingu, K.; Onishi, H.

2026-04-01 radiology and imaging 10.64898/2026.03.31.26349802 medRxiv

Top 0.1%

5.0%

Show abstract

Purpose: Manual verification of AI-based auto-contouring is labor-intensive and prone to fatigue-related errors. This study developed the large language model (LLM)-based automated Quality Assurance (QA) for auto-contouring (LAQUA) system using a multimodal LLM, Gemini 2.5 Pro, and evaluated its feasibility as a clinical primary screening tool to streamline the QA workflow. Methods: Twenty male pelvic CT scans from an open dataset were utilized. Three distinct auto-contouring software packages (OncoStudio, RatoGuide prototype and syngo.via) were evaluated. Auto-contouring results for each slice were exported as PDF images with overlaid contours and input into Gemini 2.5 Pro. The LLM was instructed to rate the contour quality on a 5-point clinical scale (5: Optimal; 4: Acceptable; 3: Suboptimal; 2: Unacceptable; redraw from scratch; 1: Unacceptable; organ not detected). Using evaluations by two board-certified radiation oncologists as ground truth, Spearman's rank correlation coefficients ({rho}) and weighted kappa coefficients ({kappa}) were calculated. Additionally, to assess screening performance, sensitivity and specificity were calculated by dichotomizing the scores into "Pass" and "Fail" using two different cutoffs (scores [≥] 3 and [≥] 4 as "Pass"). Finally, the alignment of the rationales provided by the LLM with the auto-contouring quality was evaluated by two board-certified radiation oncologists. This was conducted using a Likert scale assessing four domains (error detection, hallucination, clinical relevance, and anatomical understanding), each scored out of 2 points. Results: The LAQUA system demonstrated moderate to strong agreement with expert judgments across all evaluated organs ({rho}: 0.567 - 0.835; quadratic weighted {kappa} : 0.639 - 0.804), with the rectum showing the highest correlation. Regarding screening performance, a cutoff of [≥]3 as "Pass" achieved the highest sensitivity and specificity in specific subgroups, but with wide 95% confidence intervals (CIs). A cutoff of [≥]4 as "Pass" narrowed the CIs, yielding the highest sensitivity in the rectum (0.976) and the highest specificity in the left femoral head (0.933). Qualitatively, the LLM's rationales achieved an overall mean score of 1.70 {+/-} 0.48 (out of 2), with 155 of 291 outputs receiving perfect scores across all criteria. Conclusions: The LAQUA system demonstrated substantial agreement with expert evaluations in AI-based auto-contouring quality assessment. While potential overestimation bias (risk of missing "Fail" cases) warrants caution, the observed sensitivity suggests its feasibility as a primary screening QA tool to efficiently filter acceptable contours, thereby reducing the clinical workload.

16

Pneumonia Detection in Paediatric Chest X-Rays using Ensembled Large Language Models

Tan, J.; Tang, P. H.

2026-04-12 radiology and imaging 10.64898/2026.04.10.26347909 medRxiv

Top 0.1%

5.0%

Show abstract

Background: Paediatric pneumonia is a leading cause of childhood morbidity and mortality worldwide. Chest X-rays (CXR) are an important diagnostic tool in the diagnosis of pneumonia, but shortages in specialist radiology services lead to clinically significant delays in CXR reporting. The ability to communicate findings both to clinicians and laypersons allows MLLMs to be deployed throughout clinical workflows, from image analysis to patient communication. However, MLLMs currently underperform state-of-the-art deep learning classifiers. Objective: To evaluate the diagnostic accuracy of ensemble strategies with MLLMs compared to the baseline average agent for paediatric radiological pneumonia detection. Methods: We conducted a retrospective cohort study using paediatric CXRs from two independent hospital datasets totalling 2300 CXRs. Fifteen MedGemma-4B-it agents independently classified each CXR into five pneumonia likelihood categories. Majority voting, soft voting, and GPTOSS-20B aggregation were compared against the average agent performance. The primary metric evaluated was OvR AUROC. Secondary metrics included accuracy, sensitivity, specificity, F1-score, Cohen's kappa, and OvO AUROC. Results: Soft voting achieved improvements in OvR AUROC (p_balanced = 0.0002, p_real-world = 0.0003), accuracy (p_balanced = 0.0008, p_real-world < 0.0001), Cohen's Kappa (p_balanced = 0.0006, p_real-world = 0.0054) and OvO AUROC (p_balanced < 0.0001, p_real-world = 0.0011) across both datasets, and a superior F1-value (pbalanced = 0.0028) for the balanced dataset. Conclusion: Soft voting enhances MedGemma's diagnostic discriminatory performance for paediatric radiological pneumonia detection. Our system enables privacy-preserving, near real-time clinical decision support with explainable outputs, having potential for integration into emergency departments. Our system's high specificity supports triage by flagging high-risk radiological pneumonia cases.

17

Generalizable Deep Learning Framework for Radiotherapy Dose Prediction Across Cancer Sites, Prescriptions and Treatment Modalities

Chang, H.-h.; Cardan, R.; Nedunoori, R.; Fiveash, J.; Popple, R.; Bodduluri, S.; Stanley, D. N.; Harms, J.; Cardenas, C.

2026-04-22 radiology and imaging 10.64898/2026.04.17.26350770 medRxiv

Top 0.1%

4.9%

Show abstract

Optimizing radiotherapy dose distributions remain a resource-intensive bottleneck. Existing AI-based dose prediction methods often have limited generalizability because they rely on small, heterogeneous datasets. We present nnDoseNetv2, an auto-configured, end-to-end framework for dose prediction across diverse disease sites (head and neck, prostate, breast, and lung), prescription levels (1.5-84 Gy), and treatment modalities (IMRT, VMAT, and 3D-CRT). By integrating machine-specific beam geometry with 3D structural information, the framework is designed to generalize across varied clinical scenarios. A single multi-site model was trained on 1,000 clinical plans. On sites seen during training, performance was comparable to specialized site-specific models. On unseen sites (liver and whole brain), the model outperformed site-specific models, with mean absolute errors of 2.46% and 6.97% of prescription, respectively. These results suggest that geometric awareness can bridge disparate anatomical domains while eliminating the need for site-specific model maintenance, providing a scalable and high-fidelity approach for personalized radiotherapy planning.

18

MedSafe-Dx (v0): A Safety-Focused Benchmark for Evaluating LLMs in Clinical Diagnostic Decision Support

Van Oyen, C.; Mirza-Haq, N.

2026-04-21 health informatics 10.64898/2026.04.14.26350711 medRxiv

Top 0.2%

4.4%

Show abstract

MedSafe-Dx (v0), introduces a new safety-focused benchmark for evaluating large language models in clinical diagnostic decision support using a filtered subset of the DDx Plus dataset (N=250). MedSafe-Dx evaluates three dimensions: escalation sensitivity, avoidance of false reassurance, and calibration of uncertainty. Models were tasked with providing a ranked differential (ICD-10), an escalation decision (Urgent vs. Routine), and a confidence flag. Performance was measured via a "Safety Pass Rate," a composite metric penalizing three hard failure modes: missed escalations of life-threatening conditions, overconfident incorrect diagnoses, and unsafe reassurance in ambiguous cases. Eleven models were evaluated and revealed a significant disconnect between diagnostic recall and safety. GPT-5.2 achieved the highest Safety Pass Rate (97.6%), while several models exhibited high rates of missed escalations or unsafe reassurance. MedSafe-Dx provides a robust stress test for identifying high-risk failure modes in diagnostic decision support and shows that high diagnostic accuracy does not guarantee clinical safety. While the benchmark is currently limited by synthetic data and proxy labels, it provides a reproducible, auditable framework for testing AI behavior before clinical deployment. Our findings suggest that interventions such as safety-focused prompting and reasoning-token budgets could be essential components for the safe deployment of LLMs in clinical workflows.

19

Medicalbench: Evaluating Large Language Models Towards Improved Medical Concept Extraction

Yang, Z.; Lyng, G. D.; Batra, S. S.; Tillman, R. E.

2026-04-16 health informatics 10.64898/2026.04.12.26350704 medRxiv

Top 0.2%

4.4%

Show abstract

Medical concept extraction from electronic health records underpins many downstream applications, yet remains challenging because medically meaningful concepts, such as diagnoses, are frequently implied rather than explicitly stated in medical narratives. Existing benchmarks with human-annotated evidence spans underscore the importance of grounding extracted concepts in medical text. However, they predominantly focus on explicitly stated concepts and provide limited coverage of cases in which medically relevant concepts must be inferred. We present MedicalBench, a new benchmark for medical concept extraction with evidence grounding that evaluates implicit medical reasoning. MedicalBench formulates medical concept extraction as a verification task over medical note concept pairs, coupled with sentence level evidence identification. Built from MIMIC-IV discharge summaries and human verified ICD-10 codes, the dataset is curated through a multi stage large language model (LLM) triage pipeline followed by medical annotation and expert review. It deliberately includes implicit positives, semantically confusable negatives, and cases where LLM judgments disagree with medical expert assessments. Annotators provide sentence level evidence spans and concise medical rationales. The final dataset contains 823 high quality examples. We define two complementary evaluation tasks: (1) medical concept extraction and (2) sentence level evidence retrieval, enabling assessment of both correctness and interpretability. Benchmarking state-of-the-art LLMs and a supervised baseline reveals that performance remains modest, highlighting the difficulty of extracting implicitly expressed concepts. We further show that explicitly incorporating reasoning cues and prompting to extract implicit evidence substantially improves medical concept extractions, while performance is largely invariant to note length, indicating that MedicalBench isolates reasoning difficulty rather than superficial confounders. MedicalBench provides the first systematic benchmark for implicit, evidence-grounded medical concept extraction, offering a foundation for developing medical language models that can both identify medically relevant concepts and justify their predictions in a transparent and medically faithful manner.

20

Perioperative Mortality Prediction Using a Bayesian Ensemble with Prevalence-Adaptive Gating

Pandey, A. K.

2026-04-06 health informatics 10.64898/2026.04.03.26350114 medRxiv

Top 0.2%

4.4%

Show abstract

Background: Perioperative mortality prediction in resource-limited surgical settings remains challenging due to class imbalance, missing data, and the heterogeneity of postoperative complications. Existing risk scores such as POSSUM depend on intraoperative variables and do not quantify prediction uncertainty. Methods: We developed a prevalence-adaptive Bayesian ensemble comprising three stochastic models: classifier Variational Autoencoder (VAE, AUC=0.95), a Flipout Last Layer network (AUC=0.84), and a Monte Carlo Dropout network (AUC=0.80), trained on 697 patients (39 deaths, prevalence 5.59%) with 67 preoperative and postoperative features. Class imbalance (16.9:1) was addressed through Variational Autoencoder augmentation: two class-conditional generative VAEs produced 619 synthetic survivor and 619 synthetic death records, yielding a balanced training corpus of 1,935 samples. VAE augmentation was selected over SMOTE and random oversampling after a comparative study (F1: random oversampling 0.61 vs VAE augmentation 0.77). Validation used a held-out set of 233 patients (13 deaths, 220 survivors). A six-stage prediction pipeline incorporated weighted base risk, a three-path prevalence-adaptive gate, Shannon entropy uncertainty quantification, and rank-transform calibration. Sensitivity analysis was conducted across all six empirically derived hyperparameters. A whole-cohort death audit evaluated all 52 deaths from the complete 930-patient dataset through the deployed clinical decision support system. Statistical analysis included Kruskal-Wallis testing of entropy across triage groups, Wilson score confidence intervals for performance metrics, and Spearman rank correlation for LIME-SHAP interpretability concordance. Results: On the validation cohort the ensemble achieved complete separation (sensitivity 100%, specificity 100%, Youden J=1.000; TP=13, FP=0, TN=220, FN=0). The whole-cohort death audit identified 36 of 52 deaths (sensitivity 69.2%, 95% CI 55.7%-80.1%; precision 100%, 95% CI 90.4%-100.0%; F1=0.818, bootstrap 95% CI 0.732-0.894). Shannon entropy differed significantly across triage levels (Kruskal-Wallis H(2)=24.212, p<0.001, {epsilon}2=0.453), confirming a monotone gradient SAFE < CRITICAL < GRAY ZONE. All six hyperparameters were invariant across their tested ranges (J=1.000 throughout; Supplementary Tables S1-S2). LIME and SHAP rankings showed statistically significant concordance (Spearman {rho}=0.440, p=0.024; Kendall T=0.357, p=0.011), with 4 of 6 principal mortality determinants shared across both methods. Conclusions: A prevalence-adaptive Bayesian ensemble with entropy-based uncertainty triage achieves zero false positive alerts and clinically meaningful audit sensitivity in perioperative mortality prediction. Complete hyperparameter invariance confirms that reported performance reflects structural properties of the calibration architecture. The 16 missed deaths represent feature-invisible cases beyond current observational feature capacity.